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Abstract — The aim of this paper is to synthesize and 
implement an algorithm to compute square root efficiently and 
cost effectively. Square root is a computation required in many 
mathematical problems required in computer multimedia 
communication and in many space related data processing. So, 
there is a requirement to develop this operation efficiently. 
Hence, in this paper a new type of algorithm is used to design a 
core for finding square root. The core implemented in this paper 
uses a modified non restoring division algorithms to find square 
root. Three types of structures have been developed namely: 
basic combinational, iterative and pipeline. Basic combinatorial 
is simple implementation of non-restoring division algorithm, it 
is nothing but single stage of pipelined architecture. This 
architecture can be used when cost is the major factor and speed 
can be compromised. Iterative architecture is hardware 
efficient cost effective architecture where single hardware unit 
is used iteratively for a computation. Pipelined architecture is 
the fastest architecture to be implemented in this paper. It uses 
various stages for computation i.e. parallel execution is 
performed and hence speeding up the execution time and 
process. Pipelined architecture can be used in real time 
processing system where speed is the major factor. These three 
structures are developed to compare various parameters like 
speed, cost, reliability and distinguish them for various 
application suited by them. The core is developed for any FPGA 
processor and is simulated and debugged using XILINX ISE 
14.1. The architecture is implemented onto VIRTEX family and 
debugged on Spartan 3 XC3S400TQ144. 

Index Terms — Combinatorial, FPGA, Iterative, Pipelined, 
Spartan, Virtex 


I. Introduction 

Square root is an operation required by system graphics and 
scientific computation applications such as math 
coprocessors, DSP algorithms, data processing and control 
[1]. Hence, it is an important computation that need to be 
enhanced. In 1996, Lu and Chi [1] have proposed a ‘new 
non-restoring square root algorithm’ for VLSI 
implementation, which is better than the existing VLSI 
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algorithms for computing square root. In many VLSI real 
time image processing applications, it is high prioritized 
requirement to provide the computation of square root of a 
binary coded number with low power dissipation and fast 
computation (low delay propagation). Square root calculation 
is one of the most useful and vital operations in computer 
graphics and scientific calculation applications, such as 
digital signal processing (DSP) algorithms, math 

coprocessor, data processing and control, and even 
multimedia applications [1-6]. It is a classical problem in 
computational number theory, which is oftenly encountered 
and which is a hard task to get an exact result [7-8]. 

The paper is divided as follows: Section II describes the 
algorithm. Section III presents the implemented 

architectures. Section IV explains the results and analysis, 
and in the results a detailed comparison between the spartan 
core and Virtex core is presented. Finally, conclusions is 
given. 


II. Non Restoring Algorithm 

The focus of the previous restoring and non-restoring 
algorithms is on each bit of the square root with each 
iteration. In this section, non-restoring square root algorithm 
has been described as in [1]. Each operation consists of 
addition or subtraction based on the sign of the result of 
previous operation. The partial remainder generated in each 
iteration is used in the next iteration even it is negative [1]. At 
the final iteration, if the partial remainder is not negative, it 
becomes the final precise remainder. 


Radical: ‘D’ of ‘2n bits. Square root: ‘Q’ of ‘n’ bits: 



dj. — ^2*1— 2 — -Ofc* -fc iL 2?i — 1 

% = Q*-iQ *-2 -Q t .k<n-1 

*Note that qk has ‘n-k’ bits 

r 'n-l t^ 2 n -iD 2n -2 — 01 

a > o 

if r'n-i < 0 


For k = n-2 downto 0 do 
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+ flk+iOl/ if Qk+i — 1 

V k + I^2JH-l^k + + Qk + l = 0 


1, ifr\> 0 
0, if r\ < 0 


End 


Remainder R = ro 



; rf o if r 'o — o 

y 0 + fli 01,i/r' o 


< 0 


At each iteration, q k (the square root of is computed. 
Since d 2k = d 2(k+1) D 2k+1 D 2h that is D 2k+1 D 2k is attached to 
d 2 (k+i) to form d 2k , it can be inferred that D 2k+ iD 2k must be 
used to get q k . That explains the fact the algorithm attaches 
D 2 k+iD 2 k to r \+i to form r \ in order to get q k . The remainder 
at each iteration, called r k , has ‘n-k+1 ’ bits, one more bit than 
q k [1]: r k = R n Rn-iRn -2 ••• Rb But the algorithm uses an 
estimated remainder, called r’ k , that has ‘n-k+2’ bits, the 
MSB is the sign bit, which decides the value of Q k , and it can 
be demonstrated that only the ‘n-k+1’ least significant bits of 
r\ are used to get the next estimated remainder r’ k -i- Also, in 
order to get the real remainder R = r 0 , only the ‘n+1 ’ LSBs of 
r’ 0 are needed (the MSB determines Q 0 ). It lessens the gate 
count, since a register of only ‘n-k+1 ’ bits is needed for r’ k . 


III. Architectures 

As mentioned in abstract three types of architectures have 
been implemented which will be described in this section. 


A. Pipelined 

To implement this architecture we need to unfold the 
algorithm explained in section II. Therefore ‘n’ stages with 
‘n’ adders/subtractors will appear. By observing the first 
iteration, a reduction is obtained: 




R’n-1<- D2n-lD2n-2 - 01 

Qn-i k if r ’„-i > 0 Figure 1 : Pipelined architecture 

Qn-i 0, if r < 0 


There is no need to perform the first subtraction and wait 
one cycle, if the result from the first iteration can be 
obtained directly from the first 2 MSBs ofD. So the first 
stage can be embedded into the second stage, and there will 
be ‘n-1’ pipeline stages. 

This architecture is depicted in Figure 1 . The computation 
of the remainder is not considered, although the core 
computes it if the user wants. Note that the dotted rectangles 
indicate the registers that would have appeared if the 
reduction of the first stage hadn’t been performed. Such 
architecture can obtain a new square root each cycle. The 
initial latency is ‘n’ cycles. 


The longest path delay occurs in the last stage, because the 
adder/substractor increases in size as stages advance. A 
further improvement can be made if the last stages are 
pipelined, and the initial ones merged. 

B. Combinatorial Architecture 

This architecture is implemented because some non-real 
time applications need it, and also in order to establish a 
comparison with the core that does have a fully-combinatorial 
architecture. The architecture is very simple: It is the 
fully-pipelined architecture without the pipelining registers. 
It only has one register at the input and one at the output. 
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Figure 2: Combinatorial Architecture 
C. Iterative Architecture 

The size of the elements (registers, adder/subtractor) will be 
the size of the last stage of the pipelined architecture: 
Register R ‘n+1 ’ bits Register Q ‘n’ bits Adder / subtractor 
‘n+2’ bits. 

Since all iterations are embedded in one stage, the reduction 
of Section III A cannot be used. 


This reduces the width of the adder/subtractor by 2 bits. The 
result ‘ba’ is obtained in parallel and the carry-in comes 
from just an OR gate. So the new adder/subtractor uses ‘n’ 
bits and has carry-in. Also, note that the MSB of the second 
operator of the adder/subtractor is ‘0’ as in the pipelined 
case. Figure 3 depicts this architecture. 




Figure 3: Iterative Architecture 


But a simplification for this case exists: 

In the adder/subtractor: the 2 LSBs performs either ‘xy-01’ 
or ‘xy+1 1 ’, ‘xy’ is the pair of D bits used at each step. The 
operation yields: ‘cba’. The truth table is shown: 


cba = 

xy + 11 

xy 

cba 

00 

Oil 

01 

100 

00 

101 

01 

110 


cba = xy 
xy 

-01 

cba 

00 

111 

01 

000 

00 

001 

01 

010 


51 

reseln = 0 

R -e— 0. D — D_i. z n-1 

P] 

( Q< 

S 2 

X 0 


Z «-Z - 1 



Figure 4: FSM for iterative architecture 


‘C’: carry-in for the next stage of the adder/subtractor ‘ba’: 
result of the operation. 

‘Ba’ depends only on ‘xy’, but ‘c’ depends on the type of 
operation. Luckily, a conventional adder/sub stractor with 
carry-in (e.g. the lpm_add_sub megafuntion) treats the 
carry-in as positive logic when adding, and as negative 
logic when subtracting [3] (this is done to reduce gates 
usage). So, for subtraction, we have to invert ‘c’ to assure 
the proper working of the adder/subtractor. The new truth 
table is: 


cba = 
xy 

xy + 11 

cba 

cba = xy 
xy 

-01 

cba 

00 

Oil 

00 

011 

01 

100 

01 

100 

00 

101 

00 

101 

01 

110 

01 

110 


Finite state machine of iterative architecture is depicted in 
figure 4. This FSM controls the iterative architecture. The 
process start when s = 1 .After ‘n’ clock cycles, the result is 
obtained in register Q, done = 1 , and a new process can be 
started. 

IV. Results 

The architecture were synthesized using XILINX ISE 
vl4.1 successfully. After synthesizing the core were 
implemented on FPGA device XC3S400-TQ144 (Xilinx 
Spartan-3 family) with speed grade -5. The core presented 
does not compute the remainder, since it is rarely used. 
Figure 5 depicts the core with all its options. Table 1 
establishes a comparison between this core and the 
ALTERA core. 

Results are shown only for a specific device (Spartan 3) 
because of large results data with just one device and these 
results are enough to demonstrate the benefits of the core 
implemented. 


Now, ‘c’ and ‘ba’ depends only on ‘xy’: 

c = x + y b = (x ® y) a - y 
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Figure 5 : Parameter comparison graph for Spartan 


Figure 5: Parameter comparison graph for Virtex 
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a. Combinatorial 


b. pipelined 


c. Iterative 


Figure 7: Area comparison of all three architecture for SPARTAN 3 





a. C ombinatorial 


b. Pipelined 


c. Iterative 


Figure 7 : Area comparison of all three architecture for VIRTEX 4 


Result analysis: 

From the graphs shown in the figure 5 and 6 we can infer 
that latency is more in combinatorial. But in pipeline the 
latency till 16 bits are low than iterative. After 16 bits 
iterative and pipeline latency time crossover: i.e. for large 
input bits latency increases in pipeline. Hence efficiency in 
pipeline architecture is the highest. Maximum frequency 
operable is highest in pipelined architecture. 

Area Comparison: 

As shown in figure 7 and 8 area covered by all three 
architecture can be seen. Plan Ahead tool of XILINX was 
used to generate area and compare them. We can infer from 
figure 7 and 8 that pipeline architecture uses more area and 
iterative uses the least area. 


V. Conclusion 

The core implemented achieves high speed at minimum 
cost since it only uses only an adder/subtractor unit to 
perform the operations. The architecture is very flexible, so 
that the user can choose the best architecture for his 
application. The efficiency of this core can be observed from 
graph and table. 

The results are better in terms of speed and resource effort 
than the earlier implementation. An improvement i.e. 
simplification for the iterative architecture can be applied to 
each stage of the pipelined architecture. 
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